This short R Markdown file goes along with Section 3 of the course notes. We will be using the function trendscatter in the package s20x so you will need to install that package before going through these examples. Also I will use the pairs.plus function in one of the examples which is contained in Regression.RData directory I e-mailed you. You need to read that into R by selecting Open File… from File drop-down menu in order to have access to the pairs.plus function.

Example 3.1 - Paddlefish

We begin by reading in the paddlefish data used in the first part of the Section 3 notes.

Paddle = read.csv("http://course1.winona.edu/bdeppa/Regression/Data/Paddlefish%20(clean).csv",header=T)
names(Paddle)
## [1] "Age"    "Length" "Weight"
head(Paddle)
##   Age Length Weight
## 1   6  87.63   2.58
## 2   5  87.63   3.32
## 3   5  89.54   3.35
## 4   3  67.31   1.69
## 5   4  77.47   2.55
## 6   6 100.33   4.58
str(Paddle)
## 'data.frame':    183 obs. of  3 variables:
##  $ Age   : int  6 5 5 3 4 6 7 7 8 7 ...
##  $ Length: num  87.6 87.6 89.5 67.3 77.5 ...
##  $ Weight: num  2.58 3.32 3.35 1.69 2.55 4.58 5.34 4.49 4.53 5.41 ...
summary(Paddle)
##       Age             Length           Weight      
##  Min.   : 1.000   Min.   : 43.82   Min.   : 0.250  
##  1st Qu.: 4.000   1st Qu.: 83.19   1st Qu.: 2.570  
##  Median : 5.000   Median : 90.81   Median : 3.470  
##  Mean   : 5.541   Mean   : 90.86   Mean   : 4.248  
##  3rd Qu.: 7.000   3rd Qu.:100.65   3rd Qu.: 5.115  
##  Max.   :18.000   Max.   :140.97   Max.   :20.300

We will now use the trendscatter command in the s20x library to construct a scatterplot with smoothers added to visualize \(\small{E(Weight|Length)}\) and \(\small{SD(Weight|Length)}\).

require(s20x)
## Loading required package: s20x
trendscatter(Weight~Length,data=Paddle)

We can clearly see that \(\small{E(Weight|Length)}\) is nonlinear, i.e. exhibits curvature, and that the \(\small{Var(Weight|Length)}\) and/or \(\small{SD(Weight|Length)}\) is NOT constant. We can control or explore the effect of window width on the smoothing process by varying the fraction of the observations used in each window. The default setting in the trendscatter function is f = 0.50, i.e. use 50% of the data in each window. Keep in mind that the weighting function within in window will downweight the points as we move away from the target point.

trendscatter(Weight~Length,data=Paddle,f=0.10)    # Too noisy!

trendscatter(Weight~Length,data=Paddle,f=0.25)

trendscatter(Weight~Length,data=Paddle,f=0.75)

trendscatter(Weight~Length,data=Paddle,f=1)  # Too smooth!

Example 3.2 - Abalone

These data contain measurements and ages (rings) of 4,175 abalones. We will use these data later in the course. The function pairs.plus in the Regression.RData directory creates a scatterplot matrix with histograms and smoothers added to visualize the distribution of each variable and the relationship between each pair of variables in these data.

Abalone = read.csv("http://course1.winona.edu/bdeppa/Regression/Data/abalone.csv")
pairs.plus(Abalone)

Example 3.3 - Breast Cancer Cells

In this example we examine the relationship between cell perimeter and cell area as a function of cell radius. We will also conduct these investigations by conditioning on the tumor type, i.e. malignant (M) or benign (B). First we read in and inspect the dataset.

BreastDiag = read.csv("http://course1.winona.edu/bdeppa/Regression/Data/BreastDiag.txt",header=T)
names(BreastDiag)
##  [1] "Id"          "Diagnosis"   "Radius"      "Texture"     "Perimeter"  
##  [6] "Area"        "Smoothness"  "Compactness" "Concavity"   "ConcavePts" 
## [11] "Symmetry"    "FracDim"     "serad"       "setex"       "seperi"     
## [16] "searea"      "sesmoo"      "secomp"      "seconc"      "seconpts"   
## [21] "sesym"       "sefd"        "wrad"        "wtex"        "wperi"      
## [26] "warea"       "wsmoo"       "wcomp"       "wconc"       "wconpts"    
## [31] "wsym"        "wfd"
head(BreastDiag)
##         Id Diagnosis Radius Texture Perimeter   Area Smoothness
## 1   842302         M  17.99   10.38    122.80 1001.0    0.11840
## 2   842517         M  20.57   17.77    132.90 1326.0    0.08474
## 3 84300903         M  19.69   21.25    130.00 1203.0    0.10960
## 4 84348301         M  11.42   20.38     77.58  386.1    0.14250
## 5 84358402         M  20.29   14.34    135.10 1297.0    0.10030
## 6   843786         M  12.45   15.70     82.57  477.1    0.12780
##   Compactness Concavity ConcavePts Symmetry FracDim  serad  setex seperi
## 1     0.27760    0.3001    0.14710   0.2419 0.07871 1.0950 0.9053  8.589
## 2     0.07864    0.0869    0.07017   0.1812 0.05667 0.5435 0.7339  3.398
## 3     0.15990    0.1974    0.12790   0.2069 0.05999 0.7456 0.7869  4.585
## 4     0.28390    0.2414    0.10520   0.2597 0.09744 0.4956 1.1560  3.445
## 5     0.13280    0.1980    0.10430   0.1809 0.05883 0.7572 0.7813  5.438
## 6     0.17000    0.1578    0.08089   0.2087 0.07613 0.3345 0.8902  2.217
##   searea   sesmoo  secomp  seconc seconpts   sesym     sefd  wrad  wtex
## 1 153.40 0.006399 0.04904 0.05373  0.01587 0.03003 0.006193 25.38 17.33
## 2  74.08 0.005225 0.01308 0.01860  0.01340 0.01389 0.003532 24.99 23.41
## 3  94.03 0.006150 0.04006 0.03832  0.02058 0.02250 0.004571 23.57 25.53
## 4  27.23 0.009110 0.07458 0.05661  0.01867 0.05963 0.009208 14.91 26.50
## 5  94.44 0.011490 0.02461 0.05688  0.01885 0.01756 0.005115 22.54 16.67
## 6  27.19 0.007510 0.03345 0.03672  0.01137 0.02165 0.005082 15.47 23.75
##    wperi  warea  wsmoo  wcomp  wconc wconpts   wsym     wfd
## 1 184.60 2019.0 0.1622 0.6656 0.7119  0.2654 0.4601 0.11890
## 2 158.80 1956.0 0.1238 0.1866 0.2416  0.1860 0.2750 0.08902
## 3 152.50 1709.0 0.1444 0.4245 0.4504  0.2430 0.3613 0.08758
## 4  98.87  567.7 0.2098 0.8663 0.6869  0.2575 0.6638 0.17300
## 5 152.20 1575.0 0.1374 0.2050 0.4000  0.1625 0.2364 0.07678
## 6 103.40  741.6 0.1791 0.5249 0.5355  0.1741 0.3985 0.12440

Next we investigate the relationship between the radius and cell area & cell perimeter. For both we consider the theoretical relationships assuming that tumor cells are circular/spherical and compare these theoretical results with those obtained by scatterplot smoothing. We also examine these relationships conditional on the tumor type, i.e. benign or malignant.

The theoretical relationships are as follows:

Area: \(\small Area = \pi (Radius)^2\) Perimeter: \(\small Perimeter = 2\pi (Radius)\)

Radius and Area

trendscatter(Area~Radius,data=BreastDiag)

plot(Area~Radius,data=BreastDiag,xlab="Cell Radius",ylab="Cell Area",main="Comparing LOESS Smooth to Theoretical Model")
lines(sort(BreastDiag$Radius),pi*sort(BreastDiag$Radius^2),lty=2,col="red",lwd=3)
lines(lowess(BreastDiag$Radius,BreastDiag$Area,f=.2),lty=3,col="blue",lwd=3)

trendscatter(Area~Radius,data=BreastDiag[BreastDiag$Diagnosis=="B",],main="Benign Cells")

trendscatter(Area~Radius,data=BreastDiag[BreastDiag$Diagnosis=="M",],main="Malignant Cells")

How does the relationship between cell area and cell radius differ between benign and malignant tumor cells?

Radius and Perimeter

trendscatter(Perimeter~Radius,data=BreastDiag)

plot(Perimeter~Radius,data=BreastDiag,xlab="Cell Radius",ylab="Cell Perimeter",main="Comparing LOESS Smooth to Theoretical Model")
lines(sort(BreastDiag$Radius),2*pi*sort(BreastDiag$Radius),lty=2,col="red",lwd=3)
lines(lowess(BreastDiag$Radius,BreastDiag$Perimeter,f=.2),lty=3,col="blue",lwd=3)

trendscatter(Perimeter~Radius,data=BreastDiag[BreastDiag$Diagnosis=="B",],main="Benign Cells")

trendscatter(Perimeter~Radius,data=BreastDiag[BreastDiag$Diagnosis=="M",],main="Malignant Cells")

How does the relationship between cell perimeter and cell radius differ between benign and malignant tumor cells?